Skip to content

fix(infra): Resolve Race Condition in Parallel Base Image Builds#14189

Merged
hunsche merged 9 commits intomasterfrom
fix/base-image-race-condition
Oct 28, 2025
Merged

fix(infra): Resolve Race Condition in Parallel Base Image Builds#14189
hunsche merged 9 commits intomasterfrom
fix/base-image-race-condition

Conversation

@hunsche
Copy link
Copy Markdown
Contributor

@hunsche hunsche commented Oct 27, 2025

Summary

This PR fixes a critical race condition in the base image build process that caused the gcr.io/oss-fuzz-base/base-builder:ubuntu-24-04 image to be incorrectly built with an Ubuntu 20.04 base.

The fix ensures build steps are executed in the correct order by explicitly defining a dependency graph, guaranteeing that versioned images are always built on top of their corresponding, freshly-built base layers.

The Problem

A report indicated that the base-builder:ubuntu-24-04 image contained Ubuntu 20.04. An initial investigation confirmed this behavior.

Investigation Steps

  1. Dockerfile Verification: The entire dependency chain of Dockerfiles was inspected:

    • base-builder:ubuntu-24-04 correctly used FROM base-clang:ubuntu-24-04.
    • base-clang:ubuntu-24-04 correctly used FROM base-image:ubuntu-24-04.
    • base-image:ubuntu-24-04 correctly used FROM ubuntu:24.04.
      This ruled out any static configuration errors in the Dockerfiles themselves.
  2. Build Process Analysis: A dry-run of the infra/build/functions/base_images.py script revealed that all build steps for the different base images were being generated to run in parallel in Google Cloud Build.

Root Cause: Race Condition

The parallel execution was the source of the problem. Because the builds for base-image, base-clang, and base-builder were triggered simultaneously, a race condition occurred:

  • The base-builder:ubuntu-24-04 build would start.
  • It would immediately try to pull its base image, gcr.io/oss-fuzz-base/base-clang:ubuntu-24-04.
  • However, the build for the new base-clang:ubuntu-24-04 had not yet finished.
  • The build process would then fall back to using the existing image with that tag in the container registry, which was an older, incorrectly built version based on Ubuntu 20.04.

The same issue was happening between base-clang and base-image.

The Solution

To resolve this, we now enforce a sequential build order that respects the image dependency hierarchy.

  1. Dependency Map: An IMAGE_DEPENDENCIES dictionary was introduced in infra/build/functions/base_images.py to define the explicit build order (e.g., base-clang depends on base-image).

  2. Sequential Build Steps: The get_base_image_steps function was updated to read this map and inject a waitFor clause into each Google Cloud Build step. This forces GCB to wait for a dependency to finish building before starting the next step in the chain.

Verification

A dry-run was executed after the fix, and the generated build steps now correctly reflect the sequential dependency order. A full build was also triggered, confirming that the fix works in a real environment and produces the correct image.

This change ensures the integrity and correctness of our base images without sacrificing the parallelism between different Ubuntu version builds (e.g., the ubuntu-20-04 and ubuntu-24-04 builds still run in parallel with each other).

The parallel execution of base image builds was causing a race condition,
where dependent images (e.g., `base-builder`) were starting their build
process before their base images (e.g., `base-clang`) had finished
building in the same pipeline.

This resulted in the builder pulling the previously existing, and incorrect,
base image from the registry (e.g., an Ubuntu 20.04-based image instead of
the new 24.04 version).

This commit fixes the issue by introducing an `IMAGE_DEPENDENCIES` map
that explicitly defines the build order. The `get_base_image_steps`
function now uses this map to add `waitFor` clauses to the Google Cloud
Build steps, ensuring that images are built sequentially according to their
dependency graph.
@hunsche
Copy link
Copy Markdown
Contributor Author

hunsche commented Oct 27, 2025

/gcbrun trial_build.py zlib --fuzzing-engines libfuzzer --sanitizers address

@hunsche
Copy link
Copy Markdown
Contributor Author

hunsche commented Oct 27, 2025

/gcbrun trial_build.py zlib --fuzzing-engines libfuzzer --sanitizers address

@hunsche
Copy link
Copy Markdown
Contributor Author

hunsche commented Oct 27, 2025

/gcbrun skip

hunsche and others added 4 commits October 27, 2025 19:15
This reverts the change that added libssl-dev to the ubuntu-24-04 base-runner image. This dependency was found to be unnecessary as the build works without it, as seen in trial builds.
@hunsche
Copy link
Copy Markdown
Contributor Author

hunsche commented Oct 28, 2025

/gcbrun skip

@hunsche hunsche enabled auto-merge (squash) October 28, 2025 14:34
@hunsche
Copy link
Copy Markdown
Contributor Author

hunsche commented Oct 28, 2025

/gcbrun skip

@hunsche hunsche merged commit a5b601d into master Oct 28, 2025
19 checks passed
@hunsche hunsche deleted the fix/base-image-race-condition branch October 28, 2025 15:28
@evverx
Copy link
Copy Markdown
Contributor

evverx commented Oct 28, 2025

FWIW with this PR merged #14157 appears to have been fixed. The latest 24-04 base-builder image come with Ubuntu 24.04.

@DavidKorczynski
Copy link
Copy Markdown
Collaborator

DavidKorczynski commented Nov 26, 2025

I think this caused a regression for trial runs where we want all projects to be analysed.

This comment is meant to trigger a trial run for all projects:
#14299 (comment)

but the log shows:

Step #1 - "Legacy": INFO:root:================================================================
Step #1 - "Legacy": INFO:root:            PHASE 2: STARTING TEST BUILDS
Step #1 - "Legacy": INFO:root:================================================================
Step #1 - "Legacy": INFO:root:Build type: fuzzing
Step #1 - "Legacy": INFO:root:  - Selected projects: 309 / 1323 (due to failed production builds)
Step #1 - "Legacy": INFO:root:  - To build all projects, use the --force-build flag.
Step #1 - "Legacy": INFO:root:Starting to create and trigger builds for build type: fuzzing
Step #3 - "Ubuntu 24.04": INFO:root:Triggered all builds.
Step #3 - "Ubuntu 24.04": INFO:root:================================================================
Step #3 - "Ubuntu 24.04": INFO:root:               PHASE 2: SKIPPED BUILDS
Step #3 - "Ubuntu 24.04": INFO:root:================================================================
Step #3 - "Ubuntu 24.04": INFO:root:Total skipped builds: 1060
Step #3 - "Ubuntu 24.04": INFO:root:--- SKIPPED BUILDS ---
Step #3 - "Ubuntu 24.04": INFO:root:  - fuzzing:
Step #3 - "Ubuntu 24.04": INFO:root:    - abseil-py: Production build succeeded
Step #3 - "Ubuntu 24.04": INFO:root:    - ada-url: Production build succeeded

i.e. all projects that currently build successfully are actually skipped. Notice we primarily want the successful projects to run because we want to make sure no regressions happens in the various projects.

I assume it's these lines here:

if (project not in project_statuses or not project_statuses[project] or
force_build):
where the not was added in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants